Prosper Loan Data analysis by Abdelkarim Aydi

Data set overview: The analysed data set in this docuemnt is Loan Data from Prosper Bank. It consists in a data base of 113,937 loans with 81 variables on each loan, including among other variables: loan amount, borrower income, borrower employment status, borrower credit history and credit score which assesses credit risk for customers.

## [1] 113937     81

Selecting intresting features: Given the high number of features, I filtred out features with missing values ratio exceeding 10%.

## [1] "Pre-selected features that have less than 10 % missing values"
##                               feature      ratio
## 1                   DebtToIncomeRatio 7.50765774
## 2            EmploymentStatusDuration 6.69229486
## 3                    AmountDelinquent 6.68966183
## 4                  CurrentCreditLines 6.67386363
## 5                     OpenCreditLines 6.67386363
## 6           PublicRecordsLast12Months 6.67386363
## 7              RevolvingCreditBalance 6.67386363
## 8                 BankcardUtilization 6.67386363
## 9             AvailableBankcardCredit 6.62120295
## 10                        TotalTrades 6.62120295
## 11 TradesNeverDelinquent..percentage. 6.62120295
## 12            TradesOpenedLast6Months 6.62120295
## 13                     TotalInquiries 1.01722882
## 14            DelinquenciesLast7Years 0.86890123
## 15         TotalCreditLinespast7years 0.61174158
## 16               InquiriesLast6Months 0.61174158
## 17               CurrentDelinquencies 0.61174158
## 18           PublicRecordsLast10Years 0.61174158
## 19              CreditScoreRangeLower 0.51870771
## 20              CreditScoreRangeUpper 0.51870771
## 21                        BorrowerAPR 0.02194195
## 22                         ListingKey 0.00000000
## 23                      ListingNumber 0.00000000
## 24                ListingCreationDate 0.00000000
## 25                        CreditGrade 0.00000000
## 26                               Term 0.00000000
## 27                         LoanStatus 0.00000000
## 28                         ClosedDate 0.00000000
## 29                       BorrowerRate 0.00000000
## 30                        LenderYield 0.00000000
## 31              ProsperRating..Alpha. 0.00000000
## 32          ListingCategory..numeric. 0.00000000
## 33                      BorrowerState 0.00000000
## 34                         Occupation 0.00000000
## 35                   EmploymentStatus 0.00000000
## 36                IsBorrowerHomeowner 0.00000000
## 37                   CurrentlyInGroup 0.00000000
## 38                           GroupKey 0.00000000
## 39                   DateCreditPulled 0.00000000
## 40            FirstRecordedCreditLine 0.00000000
## 41              OpenRevolvingAccounts 0.00000000
## 42        OpenRevolvingMonthlyPayment 0.00000000
## 43                        IncomeRange 0.00000000
## 44                   IncomeVerifiable 0.00000000
## 45                StatedMonthlyIncome 0.00000000
## 46                            LoanKey 0.00000000
## 47          LoanCurrentDaysDelinquent 0.00000000
## 48         LoanMonthsSinceOrigination 0.00000000
## 49                         LoanNumber 0.00000000
## 50                 LoanOriginalAmount 0.00000000
## 51                LoanOriginationDate 0.00000000
## 52             LoanOriginationQuarter 0.00000000
## 53                          MemberKey 0.00000000
## 54                 MonthlyLoanPayment 0.00000000
## 55                LP_CustomerPayments 0.00000000
## 56       LP_CustomerPrincipalPayments 0.00000000
## 57                 LP_InterestandFees 0.00000000
## 58                     LP_ServiceFees 0.00000000
## 59                  LP_CollectionFees 0.00000000
## 60              LP_GrossPrincipalLoss 0.00000000
## 61                LP_NetPrincipalLoss 0.00000000
## 62    LP_NonPrincipalRecoverypayments 0.00000000
## 63                      PercentFunded 0.00000000
## 64                    Recommendations 0.00000000
## 65         InvestmentFromFriendsCount 0.00000000
## 66        InvestmentFromFriendsAmount 0.00000000
## 67                          Investors 0.00000000
## [1] "Selecting the most appropriate features"
##  [1] "CreditGrade"                "Term"                      
##  [3] "LoanStatus"                 "BorrowerRate"              
##  [5] "BorrowerState"              "Occupation"                
##  [7] "EmploymentStatus"           "IsBorrowerHomeowner"       
##  [9] "TotalCreditLinespast7years" "DelinquenciesLast7Years"   
## [11] "PublicRecordsLast10Years"   "IncomeVerifiable"          
## [13] "StatedMonthlyIncome"        "Recommendations"           
## [15] "ProsperRating..Alpha."      "LoanOriginalAmount"

Univariate Plots Section

## [1] 113937     16

Our data set consists in 113937 load records described with 16 features detailed as follows:

## 'data.frame':    113937 obs. of  16 variables:
##  $ CreditGrade               : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                      : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ BorrowerRate              : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ BorrowerState             : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus          : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ IsBorrowerHomeowner       : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ TotalCreditLinespast7years: int  12 29 3 29 49 49 20 10 32 32 ...
##  $ DelinquenciesLast7Years   : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years  : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ IncomeVerifiable          : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome       : num  3083 6125 2083 2875 9583 ...
##  $ Recommendations           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProsperRating..Alpha.     : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ LoanOriginalAmount        : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...

The data set consists in 8 categorical features and 8 numeric ones. For categorical feature I am plotting count plots and for numeric ones I am plotting histograms. Based on str output, I determined categorical features and numeric ones as follows:

Remark : For categorical features, missing values are sometimes encoded as “”. So they have to be handled properly. In fact, the meta data set, shows that CreditGrade is available for loans issued before 2009 and ProsperRating_alpha deals with loans issued after July 2009. So one good pre-processing step is to apporpriately blend these two risk metrics.

The vast majority of customers apply for loan without any recommandations. Another intresting aspect, is the long tail of distributions i.e some applications have an oustanding number of recommendations. I suspect this loans to be of relatively high amount and thus need more warranties to be granted, which I will check in section bi-variate section.

There are 3 loan term options: (12, 36 or 60 months). This numeric variable can be viewed as categorical ordinal variable (short term, mid term and long term). Moreover, the mode is situated at 36 months (3 years), followed by 5 year terms and then short term loans. Naturally, the term is corrolated with the loan ammount, which I will check in the next bi-variate section.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

This figure shows the distribution of borrower intrest rates. These rates range from 5 % up to 36 % with the mode situated at 32.5 %. Moreover, the distribution is very skewed to the left relative to the mode.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   17.00   25.00   26.75   35.00  136.00     697

This plot shows the number of credit applications in the past seven years at the time the credit profile was pulled for each customer. On average, applicants have a history of 26.75 Credit applications issued in the last 7 years. Given the above remark about term, some applicats must have issued multiple applications on the same time.

This plot shows the number of delinquencies in the past 7 years at the time the credit profile was pulled.The vast majority of credit applicant are non-delinquent. But given ,the distribution has a long tail to the right, some applications are manifesting a consistant delinquent behaviour, and I suspect their CreditGrade should reflect their high risk and perhaps declined application.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750003

This chart shows the distribution of monthly incomes for borrowers. This distribution is highly skewed to the right with incomes that go up to 1.5M$/month !

These values are extremely high relative to population median of 4667$/month. So, one reason I suspect is this stated incomes are fake and unjustificable otherwise they logically don’t need to borrow money at all !!! To check the distribution more precisely, I truncate the incomes at 10K$/month, and plot seperately the two distributions. The second distribution is plotted in log scale x-axis.

## # A tibble: 2 x 2
##   is_income_low median
##   <lgl>          <dbl>
## 1 FALSE         12500 
## 2 TRUE           4417.

The monthly income below 10K$ is distributed in a bell-like curve with a median of 4417 $/month. One can note a significant fraction of the applicants have no income at all. This extreme precarious population, I suspect, is applying for loans to accomodate daily needs so the ammount of the loan should be low and also rated as highly risky.

The high income applicants, have monthly income ranging from 10K$ to 1.5M$ with a median value of 12500. I prefered using the median because it is more robust than the mean to outliers. But one intriguing idea I find, why these extreme wealthy people applying for loans ?

This plot shows the distribution of applicants public records for the last 10 years. In USA, these records may contain these useful informations: - Records of law court proceedings - Mug shot - Marriage records - Death records The vast majority have empty public records but as above this distribution is heavy tailed and some applicants have large records which could correlate with a delinquent behaviour and gives a hint about Prosper decision.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

All issued loans are between 1K$ and 35K$ with a mean of 8.4K$.

Remarks: The Credit rating that was assigned at the time the listing went live. Applicable for loans before 2009.The grades are ordered as follows from high risk to low risk: 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA The “” denotes a missing values 0 - N/A. These missing values can be imputed using the ProsperRating_Alpha

The risk metric aggregation splits applicants into 7 categories as mentionend above HR (high risk) -> AA (low risk). The studied data set shows that the risk metric mode is C which stands for average risk and the smallest category is AA whose applicants loans are more likely to be granted.

The applicants are distributed across 51 states. The state that issues the most applications is California, followed by Texas and New York.

The plot shows that many applicants don’t provide their occupation. Computer programmers are among the most freqeunt applicants followed by executives and Teachers.

The vast majority of applicants are in full-time jobs.

The great majority of applicants have verifiable incomes. But what is the effect of income verificiation on credit acceptance ?

The applicants population is roughly balanced with respect to Home ownership. But still, is this piece of information useful for granting loans ?

The majority of loan record are either current or completed. But, from the perspective of bank, all other categories are worth of intrest because it represent direct loss if not repaid.

Univariate Analysis

What is the structure of your dataset?

The studied data set consists of sampled and pre-processed version of the Prosper loan data set. It contains 113665 loan records and 15 features as follows:

  • Categorical features
  • risk_metric (engineered feature)
  • LoanStatus Current - completed - charged off - canceld - due - …
  • BorrowerState (all USA states)
  • Occupation (see above)
  • EmploymentStatus (see above)
  • IsBorrowerHomeowner (boolean)
  • IncomeVerifiable (boolean)

  • numerical features
  • Term 12 36 or 60 months
  • BorrowerRate (%) 5% -> 35%
  • TotalCreditLinespast7years: 2 -> 136
  • DelinquenciesLast7Years: 0 -> 99
  • PublicRecordsLast10Years: 0 -> 38
  • Recommendations 0 -> 40
  • LoanOriginalAmount 1K$ -> 35K$

What is/are the main feature(s) of interest in your dataset?

The engineered risk metric is the main feature of intrest because it determines the outcome of loan application process.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I think the intrest rate, income and delinquency records and job status are potential informative features.

Did you create any new variables from existing variables in the dataset?

Yes, risk_metric, it is a blended version of two risk scores that are issued for different periods

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I think the StatedIncomeRate is very skewed feature with very high and unrealistic incomes for a merly 35K$ loan at best cases !!! Perhaps, I would remove the top percentile capping stated incomes at 20K$.

Bivariate Plots Section

The previous 2 charts show the correlation strcuture of numeric features.

The first chart is computed with a subset of data. To investigate all correlations, the second chart is computed on the whole data set.The covariance matrix of the numeric features and unveils many aspects of the data set. In order of correlation strength:

In the first section, I formulated some questions to answer by inspecting bivariate plots. The studied relationships are:

There is no clear relashionship between the number of recommendations and loan amount. But the statistical test shows that recommandation is statistically significant on loan amount: higher recommendation -> lower amount.

## 
## Call:
## lm(formula = amount ~ recom)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7366  -4366  -1866   3634  30027 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8365.76      18.72 446.928  < 2e-16 ***
## recom        -343.41      55.70  -6.165 7.07e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6246 on 113663 degrees of freedom
## Multiple R-squared:  0.0003343,  Adjusted R-squared:  0.0003255 
## F-statistic: 38.01 on 1 and 113663 DF,  p-value: 7.075e-10

There is no visual effect of income on loan amount.

Borrowers applying for high loan amounts tend to choose longer payment terms.

Loan applicants tend to decrease loan amount when faced with high intrest rates.

These figures are very informative of risk scoring as suspected. For moderate monthly income, applicants with higher incomes are considered less risky and are likely to get their loan accepted.However,it seems that income is not a significant factor for high income applicants.

## 
##  Pearson's Chi-squared test
## 
## data:  risk and income_verif
## X-squared = 607.5, df = 6, p-value < 2.2e-16

The plot and Chi-Square statistical test proves with high confidence that applicants with unverifiable income are scored differently than the ones with some income evidence.

## 
##  Pearson's Chi-squared test
## 
## data:  risk and is_home_owner
## X-squared = 4653.4, df = 6, p-value < 2.2e-16

The plot and Chi-Square statistical test proves with high confidence that applicants who own their home are scored differently than the ones with some income evidence and more likely to get their loan accepted.

## 
##  Pearson's Chi-squared test
## 
## data:  risk and status
## X-squared = 5201, df = 48, p-value < 2.2e-16

The plot and Chi-Square statistical test proves with high confidence that applicants employment status is an influent factor on risk score.

## 
##  Pearson's Chi-squared test
## 
## data:  risk and status
## X-squared = 9680.4, df = 66, p-value < 2.2e-16

The plot and Chi-Square statistical test proves with high confidence that applicants loan status is an informative factor on risk score.

This plot exhibits a very strong positive correlation between credit risk and loan intrest rates. Very low risk applcants get their loans granted with minimal rates (10%), and vice versa.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The feature of intrest risk_metric was found to be highly correlated with borrower rate, monthly income only for moderate incomes, and wether or not it is verifiable, employement status, home ownership.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Borrower rate and monthly income are highly negatively correlated: it seems that applicants adapt their loan amount to the intrest rate which determined by the credit score, once all negociations are finished.

Moreover there is a strong evidence that longer terms are associated with higher loan amount.

What was the strongest relationship you found?

The risk metric and borrower rate are highly correlated.

Multivariate Plots Section

It turns out that extremely high incomes (> 10K$/month) are not justifiable and curiously almost all no income applicants are unable to justify it either. Let’s get a look at risk metric for this same figures.

This figure gives some insights about the interactions between loan amount, applicants monthly income and risk metric:

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Multivariate plots enabled me to investigate some curious questions.

For example, my initial question was: What is the point of borrowing money high-income borrowers. The answer is that these applicants are pretending very high salary without any evidence. As consquence, their risk score is high and are likely to get rejected.

Moreover, I discovered that the low amount loans are likey to be rejected.

Were there any interesting or surprising interactions between features?

The last figure shows many intresting insights: For example, applicants with few delinquencies cases tend to apply for higher loans amount and get a decent risk score. However, many applicants have applied for several loans during the last years and if it happens that defaulted or not payed it back tend to apply for lower amount for the next times and as a consequence the bank tend to class them as high risk candidates.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I did not create statisctical models.

Final Plots and Summary

Plot One

Description One

This figure shows: - A boxplot of loan amounts across risk metrics: High risk applicants tend to diminish their loan amount

  • A scatter plot of number of delinquencies across risk metrics: Low loan amounts are correlated with high number of delinquencies.

Plot Two

Description Two

This figure shows a boxplot of borrower rates for different risk profiles: High risk applicants tend to have heigher rates to compensate for it.

Plot Three

Description Three

The figure shows the most 3 important factors on Prosper credit score: number of delinquencies, monthly income and loan amount.


Reflection

Insights: The analysis of Prosper Loan Data set lead me to a number of insights: - Loan amount, Delinquencies history and monthly income are the most informative features about credit risk. - Loan application is a dynamic process: borrower change and adapt to bank constrains given his/her history delinquencies, previous loan decisions … and generally loan amounts decreases over time as risk of default goes up.

Struggles: As I worked through the project, I faced some challenges: - High number of features: the step of feature pre-selection among ~100 features. - High number of missing values especially for key features such as Prosper credit scores - Presence of anomalies presence in some important features such as income with very different range form 0 -> 1.5M$ !

Surprises: As I worked through the project, the exploration surprised me :

Future work: - This study can be carried on to offer an online tool for loan simulation. Based on the above insight, one can start from the 3 most informative featurs to build a model of credit risk.